ARACHNE: a whole-genome shotgun assembler.

نویسندگان

  • Serafim Batzoglou
  • David B Jaffe
  • Ken Stanley
  • Jonathan Butler
  • Sante Gnerre
  • Evan Mauceli
  • Bonnie Berger
  • Jill P Mesirov
  • Eric S Lander
چکیده

We describe a new computer system, called ARACHNE, for assembling genome sequence using paired-end whole-genome shotgun reads. ARACHNE has several key features, including an efficient and sensitive procedure for finding read overlaps, a procedure for scoring overlaps that achieves high accuracy by correcting errors before assembly, read merger based on forward-reverse links, and detection of repeat contigs by forward-reverse link inconsistency. To test ARACHNE, we created simulated reads providing approximately 10-fold coverage of the genomes of H. influenzae, S. cerevisiae, and D. melanogaster, as well as human chromosomes 21 and 22. The assemblies of these simulated reads yielded nearly complete coverage of the respective genomes, with a small number of contigs joined into a smaller number of supercontigs (or scaffolds). For example, analysis of the D. melanogaster genome yielded approximately 98% coverage with an N50 contig length of 324 kb and an N50 supercontig length of 5143 kb. The assembly accuracy was high, although not perfect: small errors occurred at a frequency of roughly 1 per 1 Mb (typically, deletion of approximately 1 kb in size), with a very small number of other misassemblies. The assembly was rapid: the Drosophila assembly required only 21 hours on a single 667 MHz processor and used 8.4 Gb of memory.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving Genome Assembly without Sequencing

Assembly of genomes from whole-genome sequencing (WGS) projects is one of the most complex computational problems in genomics. WGS assemblers such as Arachne [1] and Celera Assembler [2] are able to process data from millions of individual sequence "reads" and construct an accurate representation of a genome. These assemblies are in the form of contigs (contiguous stretches of DNA sequence) and...

متن کامل

Supporting Text

Genome Sequencing and Assembly. Initial shotgun libraries were generated and sequenced at the Broad by the Microbial Sequencing Center yielding 76,452 (PA2192) and 77,884 (C3719) sequences (paired-reads). The reads were assembled using ARACHNE (1, 2). After refinement, final assemblies contained 82 (PA2192) and 124 (C3719) contigs with a total sequence spanning single scaffolds of 6.83 Mb (PA21...

متن کامل

Design of a compartmentalized shotgun assembler for the human genome

Two different strategies for determining the human genome are currently being pursued: one is the "clone-by-clone" approach, employed by the publicly funded project, and the other is the "whole genome shotgun assembler" approach, favored by researchers at Celera Genomics. An interim strategy employed at Celera, called compartmentalized shotgun assembly, makes use of preliminary data produced by...

متن کامل

Hapsembler: An Assembler for Highly Polymorphic Genomes

As whole genome sequencing has become a routine biological experiment, algorithms for assembly of whole genome shotgun data has become a topic of extensive research, with a plethora of off-the-shelf methods that can reconstruct the genomes of many organisms. Simultaneously, several recently sequenced genomes exhibit very high polymorphism rates. For these organisms genome assembly remains a cha...

متن کامل

Aggressive assembly of pyrosequencing reads with mates

MOTIVATION DNA sequence reads from Sanger and pyrosequencing platforms differ in cost, accuracy, typical coverage, average read length and the variety of available paired-end protocols. Both read types can complement one another in a 'hybrid' approach to whole-genome shotgun sequencing projects, but assembly software must be modified to accommodate their different characteristics. This is true ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Genome research

دوره 12 1  شماره 

صفحات  -

تاریخ انتشار 2002